# A Detailed Review on Lifting Based DWT Architectures for Image Compression

Sudhir Kumar<sup>1</sup>, Sachin Bandewar<sup>2</sup>

<sup>1</sup>Mtech Scholar, <sup>2</sup>Assistant Professor, Department of EC, SSSCE, RKDF University, Bhopal, M.P, India <sup>1</sup>sudhirkumar6870@gmail.com, <sup>2</sup>sachin.bandewar9@gmail.com

Abstract: The aim of this paper is to give a review of VLSI architectures for 2-DWT implementation of wavelet transform by using 4 Booth multiplier. This review paper describes implementation of radix-4 Modified Booth Multiplier and this implementation is compared with Radix-2 Booth Multiplier. Modified Booth's algorithm employs both addition and subtraction and also treats positive and negative operands uniformly. Parallel MAC is frequently used in digital signal processing and video/graphics applications. A new architecture of multiplier - and accumulator (MAC) for high speed arithmetic by combining multiplication with accumulation and devising a carry-look ahead adder (CLA), the performance is improved. Modified Booth multiplication algorithm is designed using high speed adder. High speed adder is used to speed up the operation of Multiplication. Designing of this algorithm is done by using VHDL and simulated using Modelsim SE 16.3f software has been used and implemented on Matlab R2013b. This paper proposes the design and implementation of Booth multiplier using VHDL. This compares the power consumption and delay of radix 2 and modified radix 4 Booth multipliers. We can achieve the Experimental results demonstrate that the modified radix 4 Booth multiplier has 22.9% power reduction than the conventional radix 2 Booth Multiplier and near about 50% power reduction than the conventional Normal Booth Multiplier

**Keywords**: Discrete Wavelet Transform (DWT), VLSI architectures, image compression. VLSI, Carry Select Adder(CSA), Carry Look Ahead Adder (CLA), ASM

## I. INTRODUCTION

The Discrete Wavelet Transform (DWT) plays a major role in the fields of signal analysis, computer vision, object recognition, image compression and video compression standard. The advantage of DWT over other traditional transformations is that it performs multi resolution analysis of signals with localization both in time and frequency. At present, many VLSI architectures for the 2-D DWT have been proposed to meet the requirements of real-time processing. The implementation of DWT in practical system has issues. First, the complexity of wavelet transform is several times higher than that of DCT. Second, DWT needs extra memory for storing the intermediate computational results. Moreover, for real time image compression, DWT has to process massive amounts of data at high speeds. The use of software implementation of DWT image compression provides flexibility for manipulation but it may not meet timing constraints in certain applications. Hardware implementation of DWT has practical obstacles. First, is that the high cost of hardware implementation of multipliers. Such implementations require both large number of arithmetic computations and storage, which are not desirable for either high speed or low power image / video processing. Therefore a new approach called the lifting scheme based wavelet transform was proposed by Matlab. based on a spatial construction of the second generation. The lifting scheme has many advantages over the previous approaches. In particular, all the interesting properties of wavelets, coefficients. As a consequence, it is easier to design wavelet filters. Unlike convolutional

hardware utilization and capable of handling multiple data streams for the calculation of 2D DWT. This paper focuses to give brief survey on 2D DWT hardware architectures with their implementation VLSI structures and computational complexities. In various one-dimensional lifting-based DWT architectures suitable for VLSI implementation and comparison of the hardware and timing complexities of all the architecture, gives the memory requirement for 2-dimensional DWT architectures, followed by representative architectures and a comparison of their hardware and timing complexities with the possibility of extending to multilevel input signals[7].



Fig1. Block diagram flow of a traditional 2-D DWT [2].

# II. NORMAL BOOTH MULTIPLIER

In many real-time DSP applications, high performance is a critical concern. Multiplication is the most fundamental arithmetic operation used in most of the signal processing algorithms. But multipliers usually have large area, larger delay and consume more power. However, achieving this may be done at the cost of area, on chip power consumed and delays. In the binary number system the digits, called bits, are limited to the set  $\{0, 1\}$ . The result of multiplying any binary number by a binary bit is either 0, or the original number. This makes formation of the intermediate partial-products simple and efficient. Adding all these partial-products is time

consuming task for any binary multipliers. The entire process consists of three steps partial product generation, partial product reduction and addition of partial products as shown in Fig 1. But in booth multiplication, partial product generation is done based on recoding scheme e.g. radix 2 encoding. Bits of multiplicand (Y) are grouped from left to right and corresponding operation on multiplier (X) is done in order generate the partial product. In radix-2 booth multiplication partial product generation is done based on encoding which is as given by Table1. Parallel



Fig 2. Flow chart for normal Booth Multiplier[16]

Multiplication using normal Booth's recoding algorithm technique based on the fact that partial product can be generated for group of consecutive 0"s and 1"s which is called Booth's recoding. This recoding algorithm is used to generate efficient partial product. These partial products always have large number of bits than the input number of bits. This increase in the width of partial product usually depends upon the radix scheme used for recoding. So, these scheme uses less partial product generation which in turn provides low power and area but in the Normal Booth multiplier Ripple Carry Adder is used shown in Fig[19].



# Drawbacks of using Ripple Carry Adder:

1 It is not efficient when large numbers of bits are used.

2 Carry propagation delay increases linearly with bit length as next stage output is dependent on previous stage output.

Logic equations:-

$$C_i = x_i & y_i(1)$$
  
 $P_i = x_i \land y_i(2)$   
 $S_i = P_i \land C_i(3)$   
 $P_i = Partial\ product$   
 $C_i = Carry\ bit$ 

Recoding scheme used in radix-2 booth multiplier is show in the Table 1.

Table 1 Recoding Table for Booth Multiplier [16]

| Ļ | Qn | Qn+1 | Recoded Bits | Operation  |
|---|----|------|--------------|------------|
|   | 0  | 0    | 0            | Shift      |
|   | 0  | 1    | +1           | Add X      |
| П | 1  | 0    | d -1 \       | Subtract X |
|   | 1  | 1    | 0            | Shift      |

Hardware implementation of Booth Algorithm requires the register configuration. So named the Multiplier and Multiplicand as registers 'A', 'B' and 'Q' as AC, BR as shown in Complete flow chart 2. An extra flip flop is Q<sub>n+1</sub> is added to provide a double bit inspection of the multiplier. The complete fig 3 has shown condition basis implementation of the Radix-2 Booth Algorithm[16].

### 2. Radix 2-Booth Multiplier

The Booth algorithm was invented by A. D. Booth, forms the base of Signed number multiplication algorithms that are simple to implement at the hardware level, and that have the potential to speed up signed multiplication Considerably. Booth's algorithm is based upon recoding the multiplier, y, to a recoded, value, z, leaving the multiplicand, x, unchanged. In Booth recoding, each digit of the multiplier can assume negative as well as positive and zero values. There is a special notation, called signed digit (SD) encoding, to express these signed digits. In SD encoding +1 and 0 are expressed as 1 and 0, but -1 is expressed as 1 (Vincent P. Heuring, 2003). The value of a 2s complement integer was defined a by equation 1.

$$y = -y_{m-1} 2^{m-1} + \sum_{i=0}^{m-2} y_i 2^i$$
 (1)

This equation says that in order to get the value of a signed 2's complement number, multiply the  $m-i^{th}$  digit by -2'-1, and multiply each remaining digit i by +2g. For example, -7, which is 1001 in 2's complement notation, would be, in SD notation, 1001 = -8 + 0 + 0 + 1 = -7. For implementing booth algorithm most important step is booth recoding. By booth recoding we can replace string of 1s by 0s. For example the value of strings of five 1s,  $11111 = 2^5-1 = 100001 = 32-1=31$ . Hence if this number were to be used as the multiplier in a multiplication, we could replace five additions by one addition and one subtraction.

The Booth recoding procedure, then, is as follows:

- 1. Working from LSB to MSB, replace each 0 digit of the original number with a 0 in the recoded number until a 1 is encountered
- 2. When a 1 is encountered, insert a 1 at that position in the recoded number, and skip over any succeeding I's until a 0 is encountered.
- 3. Replace that 0 with a 1 and continue. This algorithm is expressed in tabular form in Table 1, considering pairs of numbers,  $y_{i-1}$  and  $y_i$  and the recoded digit,  $z_i$ . as shown in Table 1.

**Table 2:** Booth recoding table for radix-2 [11].

| $y_i$ | $y_{i-1}$ | $z_{i-1}$ | Multiplier<br>Value | Situation     |
|-------|-----------|-----------|---------------------|---------------|
| 0     | 0         | 0         | 0                   | String of 0s  |
| 0     | 1         | 1         | +1                  | End of string |
|       |           |           | - 63                | of 1 s        |
| 1     | 0         | 1         | -1                  | Begin string  |
|       |           | 1         |                     | of 1s         |
| 1     | 1         | 0         | 0                   | string of 1s  |

### **Booth Algorithm:**

- 1. Add 0 to right of LSB of multiplier and look at rightmost of multiplier to make pairing of 2 bits from right to left and mark corresponding multiplier value as shown in fig. 1
- 2. 00 or11:do nothing.
- 3. 01: Marks the end of a string of 1s and add multiplicand to partial product (running sum)
- 4. 10: Marks the beginning of a string of 1s substract multiplicand from partial product.

one of the solution realizing high sped multipliers is to enhance parallelism which helps in decreasing the number of subsequent calculation stages. The original version of booth's multiplier (radix-2) had two drawbacks.

- 1. The number of add / subtract operation became varible and hence became inconvenient while designing Parallel multipliers.
- 2. The Algorithm becomes inefficient when there are isolated 1s.

These Problem are overcome by using Radix-4 Booth algorithm which can scan string of three. This booth multiplier technique is to increase speed by reducing the number of partial product by half [11].

## **Architecture of Radix-4 Booth Multiplier:**

Multipliers are the essential components in real time signal processing and also for all the multimedia applications. Many previous works were done in implementing high-speed multiplier to reduce power consumption. This is due to the increased demand for portable multimedia applications which require low power consumption as well as high speed operation. However low-power multipliers without any

consideration for high-speed are not the appropriate solutions of low-energy embedded signal processing for multimedia applications. Previously, a hybrid radix-4 modified Booth encoded (MBE) multiplier was proposed for low-power and high-speed operation. This multiplier architecture had separate radix-4. Both encoders were operated regardless of the input patterns, resulting in power and area overhead. Therefore although its power consumption was reduced compared to the radix-4 architecture, its critical path delay was considerably increased. As a result, its energy efficiency was not improved over conventional radix-4.

We propose a multiplier that computes the partial product and parallely performs the accumulation of the partial products. This multiplier operates on the Modified Booth Encoder algorithm and the Wallace tree in radix-4 mode. In the majority of the input cases, the radix-8 architecture is as fast as radix-4 architecture while consuming less power. However, in the remaining input cases, the radix-8 architecture is bottlenecked by the generation of the ±3B partial product term, which requires an additional carry propagation adding stage. Therefore, we use radix-4 multiplier, which is faster at its operation and the cost of power increases. The structure of the proposed multiplier is illustrated in figure 1[15].



Fig4: Architecture of Radix-4 multiplier [13].

## III. RADIX-4 MODIFIED BOOTH ALGORITHM

The modified Booth algorithm minimises the number of partial products by half. We used the modified Booth encoding (MBE) scheme. It is known as the most efficient Booth encoding and decoding scheme. To multiply, multiplicand 'X' by multiplier 'Y' using the modified Booth algorithm. First group the multiplier bits 'Y' by three bits and encoding into one of {-2, -1, 0, 1, 2}. Prior to convert the multiplier, a zero is appended into the Least Significant Bit (LSB) of the multiplier. Table I shows the rules to generate the encoded signals by MBE scheme and Fig. 2 (a) shows the corresponding logic diagram. The Booth decoder generates the partial products using the encoded signals[11].



Fig5. Basic arithmetic steps of multiplication and accumulation[9],[10].

1's. These problems are overcome by using modified Radix-4 Booth multiplication algorithm.

Booth algorithm which scans strings of three bits is given below:

1) Extend the sign bit 1 position if necessary to ensure that n

1) Extend the sign bit 1 position if necessary to ensure that n is even.

(ii) The algorithm becomes inefficient when there are isolated

- 2) Append a 0 to the right of the LSB of the multiplier.
- 3) According to the value of each vector, each Partial Product will be 0, +M,-M, +2M or -2M.

The negative values of B are made by taking the 2's complement and in this paper Carry-look-ahead (CLA) fast adders are used. The multiplication of M is done by shifting M by one bit to the left. Thus, in any case, in designing n-bit parallel multiplier, only n/2 partial products are produced.

The partial products are calculated according to the following rule.



where B is the multiplier

Consider example for radix 4:

| A                  |   |   | 01         | 00  | 01         |    |    | 17                 |
|--------------------|---|---|------------|-----|------------|----|----|--------------------|
| X                  | × |   | 11         | 01  | 11         |    |    | -9                 |
| Y                  |   |   | $0\bar{1}$ | 10  | $0\bar{1}$ |    |    | recoded multiplier |
|                    |   |   | -A         | +2A | -A         |    |    | operation          |
| Add -A             | + |   | 10         | 11  | 11         |    |    |                    |
| 2-bit Shift        |   | 1 | 11         | 10  | 11         | 11 |    |                    |
| $\mathrm{Add}\ 2A$ | + | 0 | 10         | 00  | 10         |    |    |                    |
|                    |   |   | 01         | 11  | 01         | 11 |    |                    |
| 2-bit Shift        |   |   | 00         | 01  | 11         | 01 | 11 |                    |
| Add -A             | + |   | 10         | 11  | 11         |    |    |                    |
|                    |   |   | 11         | 01  | 10         | 01 | 11 | -153               |



Table 3 : Radix 4 Booth Table [10]

| Select Line | Partial Products (Operation) |
|-------------|------------------------------|
| (Encoding)  |                              |
| 000         | Add 0                        |
| 001         | Add multiplicand             |
| 010         | Add multiplicand             |
| 011         | Add 2* multiplicand          |
| 100         | Subtract 2* multiplicand     |
| 101         | Subtract multiplicand        |
| 110         | Subtract multiplicand        |
| 111         | Subtract 0                   |

The recoding is done by appending one zero to the Least Significant Bit (LSB) and extending the Most Significant Bit (MSB) with the sign bit if necessary. Then the grouping of 3 bits from the LSB is done as shown in Fig 2. The obtained result is -1 -1 0 -2. This result is multiplied with the multiplier and the number of partial product is reduced [10].

One of the solutions of realizing high speed multipliers is to enhance parallelism which helps to decrease the number of subsequent calculation stages. The original version of the Booth algorithm (Radix-2) had two drawbacks. They are:

(i) the number of add subtract operations and the number ofshift operations become variable and become inconvenient in designing parallel multipliers.

# IV. VLSI ARCHITECTURE IMPLEMENTATION

The architecture of the proposed ECAT Booth multiplier is designed by using tree-based carry save reduction followed by parallel-prefix carry-propagate addition architecture. The whole architecture of the proposed ECAT Booth multiplier is shown in Fig.accumulator. In final adder both sum and carry is added to produce the 2N bits product.



Fig7. VLSI Architecture [10].

### V. ARCHITECTURE OF A MULTIPLIER

A multiplier can be divided into three operational steps:

- i. Radix-4 Booth algorithm in which a partial product is generated.
- ii. Carry save adder and Accumulator
- iii. The final addition in which the final multiplication result is produced by adding the sum and the carry



Fig 8: MAC Multiplier[9].

Generally if N-bit data of multiplicand 'X' is multiplied with N-bit multiplier 'Y' then it generates N- partial products. But if Radix-4 booth algorithm is used then number of partial products will be reduced to N/2. In addition, the signed multiplication based on 2's complement numbers is also possible.

$$X = -2^{N-1}x_{N-1} + \sum_{i=0}^{N-1} xi \ 2^{i}, \quad xi \in 0, 1$$

$$X \times Y = \sum_{i=0}^{\frac{N}{2}-1} di \ 2^{2i} \ Y$$
Where 
$$di = -2x_{2i+1} + x_{2i} + x_{2i-1}$$

$$P = X \times Y + Z = \sum_{i=0}^{N/2-1} di \ 2^i \ Y + \sum_{j=0}^{2N-1} zi \ 2^i.$$

In CSA, the sign extension is used in order to increase thebit density of the operands. Half adder is used to generate sum and carry in CSA. The generated carry is stored inaccumulator [9].

## VII. 2-D DISCRETE WAVELET TRANSFORM

The main challenges in the hardware architectures for 1-D DWT are the processing speed and the number of multipliers and adders while for 2-D DWT it is the memory issue that dominates the hardware cost and the architectural complexity. A 2-D DWT is a separable transform where 1-D wavelet transform is taken along the rows and then a 1-D wavelet

transform along the columns. The 2-D DWT operates by inserting array transposition between the two 1-D DWT. The rows of the array are processed first with only one level of decomposition. This essentially divides the array into two vertical halves, with the first half storing the average coefficients, while the second vertical half stores the detail coefficients. This process is repeated again with the columns, resulting in four sub-bands within the array defined by filter output as in three-level decomposition.

The LL sub-band represents an approximation of the original image, the LL1 sub-band can be considered as a 2:1 subsampled version of the original image. The other three subbands HL1,LH1, and HH1 contain higher frequency detail information. This process is repeated for as many levels of decomposition as desired. The JPEG2000 standard specifies five levels of decomposition, although three are usually considered acceptable in hardware. In order to extend the 1-D filter to compute 2-D-DWT in JPEG2000, two points have to be taken into account. Firstly, the 1-D DWT generates the control signal memory to compute 2-D DWT and manages the internal memory access. Secondly, we need to store temporary results generated by 2-Dcolumn filter. The amount of the external memory access and the area occupied by the embedded internal buffer are considered the most critical issues for the implementation of 2D-DWT. As the cache is used to reduce the main memory access in the general processor architectures, in similar way, the internal buffer is used to reduce the external memory access for 2D-DWT. However, the internal buffer would occupy much area and consumption.Three main architecture approaches were proposed in the literature with the aim to implement efficiently the 2D-DWT level by level, line-based and block based architectures. These architectures address this difficulty in different ways. A typical level-by-level architecture as in Fig.17 uses a single processing module that first processes the rows, and then the columns.

Intermediate values between row and column processing are stored in memory. Since this memory must be large enough to keep wavelet coefficients for the entire image, external memory is usually used. Access to the external memory is sometimes done in row-wise order, and sometimes in column-wise order, so high-bandwidth access modes cannot be used. 157 external memory access can become the performance bottleneck of the system for the given J level of decomposition [7].

## VIII. CONCLUSION

In this paper, we have trying to achieved the 2-dimensional Discrete Wavelet Transform image compression process by using the Radix 4-booth multiplier designed using VHDL and computational time for the DWT architectures using Matlab and Modelsim6.3f Software. This review paper is useful for explaining a new method of 2-D DWT architectures by using 4 Booth multiplier capable of compressing image suitable for application in image and video processing multimedia real time applications.

### IX. REFERENCES

- [1] Wei Zhang, *Member, IEEE*, Zhe Jiang, Zhiyu Gao, and Yanyan Liu, "An Efficient VLSI Architecture for Lifting-Based Discrete Wavelet Transform" *IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS—II: EXPRESS BRIEFS, VOL. 59, NO. 3, MARCH 2012.*
- [2] Chih-Hsien Hsia, Member, IEEE, Jen-Shiun Chiang, Member, IEEE, and Jing-Ming Guo, Senior Member, IEEE "Memory-Efficient Hardware Architecture of 2-D Dual-Mode Lifting-Based Discrete Wavelet Transform" IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY, VOL. 23, NO. 4, APRIL 2013.
- [3] Jinook Song, Student Member, IEEE, and In-Cheol Park, Senior Member, IEEE, "Pipelined Discrete Wavelet Transform Architecture Scanning Dual Lines" *IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS—II:* EXPRESS BRIEFS, VOL. 56, NO. 12, DECEMBER 2009.
- [4] Yeong-Kang Lai, Member, IEEE, Lien-Fei Chen, Student Member, IEEE, and Yui-Chih Shih, "A High-Performance and Memory-Efficient VLSI Architecture with Parallel Scanning Method for 2-D Lifting Based Discrete Wavelet Transform" IEEE Transactions on Consumer Electronics, Vol. 55, No. 2, MAY 2009 Contributed Paper.
- [5] Basant Kumar Mohanty, Senior Member, IEEE, and Pramod Kumar Meher, Senior Member, IEEE," Memory-Efficient High-Speed Convolution-based Generic Structure for Multilevel 2-D DWT" IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY, VOL. 23, NO. 2, FEBRUARY 2013.
- [6] Yusong Hu, Student Member, IEEE, and Ching Chuen Jong, Member, IEEE, "A Memory-Efficient High-Throughput Architecture for Lifting-Based Multi-Level 2-D DWT" IEEE TRANSACTIONS ON SIGNAL PROCESSING, VOL. 61, NO. 20, OCTOBER 15, 2013.
- [7] Usha Bhanu.N and Dr.A. Chilambuchelvan, "A Detailed Survey on VLSI Architectures for Lifting based DWT for efficient hardware implementation" *International Journal of VLSI design & Communication Systems (VLSICS) Vol.3, No.2, April 2012.*
- [8] S. Shafiulla Bashal, Syed. Jahangir Badashah," DESIGN AND IMPLEMENTATION OF RADIX-4 BASED HIGH SPEED MULTIPLIER FOR ALU'S USING MINIMAL PARTIAL PRODUCTS" International Journal of Advances in Engineering & Technology, July 2012. ©IJAET ISSN: 2231-1963.
- [9] G. Jaya Prada, N.C. Pant "Design and Verification of faster Multiplier", vol. 1 issue 3, pp. 683-686, ISSN: 2248-9622. Vol. 1, Issue 3, pp.683-686.
- [10] M. Gopinathan & D. Jessintha (2013) "An Error Compensated DCT Architecture with Booth Multiplier", *IJAEEE- ISSN: 2278-8948, Volume-2, Issue-4, 2013.*
- [11] Sukhmeet Kaur, Suman and Manpreet Signh Manna," Implementation of Modified Booth Algorithm (Radix 4) and its Comparison with Booth Algorithm (Radix-2)" *Advance in*

- Electronic and Electric Engineering. ISSN 2231-1297, Volume 3, Number 6 (2013), pp. 683-690.
- [12] K. Babulu, G. Parasuram, "FPGA Realization of Radix-4 Booth Multiplication Algorithm for High Speed Arithmetic Logics" (IJCSIT) International Journal of Computer Science and Information Technologies, Vol. 2 (5), 2011, 2102-2107, ISSN:0975-9646.
- [13] Bodasingi Vijay Bhaskar, Valiveti Ravi Tejesvi, Reddi Surya Prakash Rao," Implementation of Radix-4 Multiplier with a Parallel MAC unit using MBE Algorithm" International Journal of Advanced Research in Computer Engineering & Technology (IJARCET) Volume 1, Issue 5, July 2012, ISSN: 2278 1323.
- [14] P.Ramesh , A.Deepthi, K.Sowjanya," Implementation of Discrete Wavelet Transform on FPGA to Detect Electrical Power System Disturbances" *International Journal Of Engineering And Computer Science ISSN*:2319 7242 Volume 2 Issue 11 November, 2013 Page No. 3223-3227.
- [15] S. JAGADEESH, S. VENKATA CHARY, "Design of Parallel Multiplier—Accumulator Based on Radix-4 Modified Booth Algorithm with SPST" *International Journal Of Engineering Research And Applications (IJERA) ISSN:* 2248-9622 www.ijera.com Vol. 2, issue 5, September-October 2012, pp.425-431.
- [16] Sakshi Rajput, Priya sharma, Gitanjali and Garima, "High Speed and Reduced Power –Radix-2 Booth Multiplier" (IJCEM) International Journal of Computational Engineering & Management, Vol. 16 Issue 2, March 2013 ISSN (Online): 2230-7893.
- [17] Nishat Bano, "VLSI Design of Low Power Booth Multiplier" International Journal of Scientific & Engineering Research, Volume 3, Issue 2, February -2012 I ISSN 2229-5518.